Alejandro Schuler
2022
Adapted from Steve Bagley and based on R for Data Science by Hadley Wickham
Edited by David Connell
By the end of these slides you should be able to…

If you haven't already, please open RStudio on DataHub by clicking this link. If you're viewing this on bCourses, you'll have to right click and then choose “Open Link in New Tab”.
You will get more out of this tutorial if you try out these things in R yourself!!
The R console window is the left (or lower-left) window in RStudio.
> 1 + 2
[1] 3
3 is the answer[1] means: the answer is a vector (a list of elements of the same type) and this line starts with the first element of that vector.> 1 +2
> 1+ 2
> 1+2
> 1 + 2
These all do the same thing. The result of each line is 3:
[1] 3
> 1 + 2 * 3 # R respects order of operations
[1] 7
> 3/4
[1] 0.75
> 6^3
[1] 216
> log(10) # natural log
[1] 2.302585
> log10(10) # log base 10
[1] 1
> sqrt(16)
[1] 4
> c(2.1, -4, 22)
[1] 2.1 -4.0 22.0
c( ) function, which is short for “combine”> 1:50
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
: is a handy shortcut to create a vector that is
a sequence of integers from the first number to the second number
(inclusive).[ ] notation. The second output line starts
with 26, which is the 26th element of the vector.An operation is elementwise (or element-wise) if the action you perform on a vector produces a vector with the same dimensions as the original.
The code below multiplies each element of 1:10 by the corresponding
element of 1:10, that is, it squares each element.
> (1:10)*(1:10)
[1] 1 4 9 16 25 36 49 64 81 100
> (1:10)^2
[1] 1 4 9 16 25 36 49 64 81 100
: has a higher precedence than addition +.> 1 + 0:10
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:10 + 1 # which operator gets executed first?
[1] 1 2 3 4 5 6 7 8 9 10 11
> (0:10) + 1
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:(10 + 1)
[1] 0 1 2 3 4 5 6 7 8 9 10 11
> x <- 10
> x
[1] 10
> x / 5
[1] 2
/ is the division operator.> x <- 10
> x
[1] 10
> x <- x + 1
> x
[1] 11
x and y everywhere.Main.database.first.object.header.length).?make.names for the complete rules on
what can be a name.> a <- 1
> A # this causes an error because A does not have a value
Error: object 'A' not found
> my_age_end_of_year = 31
> this_year = 2022
> my_birth_year = this_year - my_age_end_of_year
> my_birth_year
[1] 1991
Source: OOMPH course PHW251 - R for Public Health
> sqrt(2)
[1] 1.414214
> sqrt(0:10)
[1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[9] 2.828427 3.000000 3.162278
> x <- 4
> sqrt(x)
[1] 2
> x
[1] 4
> y <- sqrt(x)
> y
[1] 2
> x <- 10
> y
[1] 2
y after changing the value of x?x remains the same after sqrt(x))y), it keeps its value until updated, even if you change other variables (x) that went into the original assignment of that variable> sum
sum, then hit the TAB key (or just wait a second)sum.RETURN or ENTER to select the current
item.Type ?name for help on name. Example:
> ?log
log function (and related functions) in the Help pane, including the name and meaning of the arguments and returned values.> weights <- c(1.1, 2.2, 3.3)
> weights <- c(1.1, 2.2, 3.3)
> # this divides the weights, element-wise, by the conversion factor:
> weights / 2.2
[1] 0.5 1.0 1.5
> shoesize <- c(9, 12, 6, 10, 10, 16, 8, 4)
> shoesize
[1] 9 12 6 10 10 16 8 4
> sum(shoesize)
[1] 75
> sum(shoesize)/length(shoesize)
[1] 9.375
> mean(shoesize)
[1] 9.375
> x <- c(7, 3, 1, 9)
x from x, and then sum
the result.> x <- c(7, 3, 1, 9)
> mean(x)
[1] 5
> x - mean(x)
[1] 2 -2 -4 4
> sum(x - mean(x)) # answer in one expression
[1] 0
factorial(1:10)Command-RETURN (Mac), or Ctrl-ENTER (Windows).Code menu for other commands.> # This is a comment
> 1 + 2 # add some numbers
[1] 3
# to start a comment.If you're working in R locally (installed on your computer), you will need to install the tidyverse package. If you're on DataHub it has already been installed.
> install.packages("tidyverse")
> library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ tibble 3.1.5 ✓ dplyr 1.0.7
✓ tidyr 1.1.4 ✓ stringr 1.4.0
✓ readr 2.0.2 ✓ forcats 0.5.1
✓ purrr 0.3.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library("tidyverse") at the top of every script file.A data frame is one of the most powerful features in R.
> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
tibble is a kind of data frame. This one has 32 rows and 11 columns. We only see the first 10 rows because of limited slide/screen space.<dbl>, means double-precision floating point number, which is a computer science term for any number with a decimal point in it (e.g. 1.3333, 3.14159, 1.0)> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
> mtc = read_csv("https://tinyurl.com/mtcars-csv")
read_csv (from the readr package, part of tidyverse) reads in data frames that are stored in .csv files (.csv = comma-separated values)read_csv("path/to/file/mtcars.csv")?read_csv to learn a bit more.csvs can also be exported from spreadsheets and databases and then saved locally to be read into R.tibble() to make your own data frames from scratch in R> my_data = tibble( # newlines don't do anything, just increase code readability
+ mrn = c(1, 2, 3, 4),
+ age = c(33, 48, 8, 29)
+ )
> my_data
# A tibble: 4 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
3 3 8
4 4 29
dim() gives the dimensions of the data frame. ncol() and nrow() give you the number of columns and the number of rows, respectively.> dim(my_data)
[1] 4 2
> ncol(my_data)
[1] 2
> nrow(my_data)
[1] 4
names() gives you the names of the columns (a vector)> names(my_data)
[1] "mrn" "age"
glimpse() shows you a lot of information> glimpse(my_data)
Rows: 4
Columns: 2
$ mrn <dbl> 1, 2, 3, 4
$ age <dbl> 33, 48, 8, 29
head() returns the first n rows> head(my_data, n=2)
# A tibble: 2 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
The rest of this section shows the basic data frame functions (“verbs”) in the dplyr package (part of tidyverse). Each operation takes a data frame and produces a new data frame.
filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnsAll verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see the basics of how these verbs work.
> filter(mtc, mpg >= 25)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
6 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg >= 25, qsec < 19)
# A tibble: 4 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
3 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg > 60)
# A tibble: 0 × 11
# … with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
# wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
== tests for equality (do not use = which is for assignment)> and < test for greater-than and less-than>= and <= are greater-than-or-equal and less-than-or-equal> c(1,5,-22,4) > 0
[1] TRUE TRUE FALSE TRUE
hp) greater than 200?> filter(mtc, hp > 200)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 10.4 8 460 215 3 5.42 17.8 0 0 3 4
4 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
7 15 8 301 335 3.54 3.57 14.6 0 1 5 8
> select(mtc, mpg, qsec, wt)
# A tibble: 32 × 3
mpg qsec wt
<dbl> <dbl> <dbl>
1 21 16.5 2.62
2 21 17.0 2.88
3 22.8 18.6 2.32
4 21.4 19.4 3.22
5 18.7 17.0 3.44
6 18.1 20.2 3.46
7 14.3 15.8 3.57
8 24.4 20 3.19
9 22.8 22.9 3.15
10 19.2 18.3 3.44
# … with 22 more rows
select() can also be used to select everything except for certain columns by using the minus character -> select(mtc, -hp)
# A tibble: 32 × 10
mpg cyl disp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 3.9 2.62 16.5 0 1 4 4
2 21 6 160 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
select() has a friend called pull() which returns a vector instead of a (one-column) data frame> select(mtc, hp)
# A tibble: 32 × 1
hp
<dbl>
1 110
2 110
3 93
4 110
5 175
6 105
7 245
8 62
9 95
10 123
# … with 22 more rows
> pull(mtc, hp)
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
> filter(mtc, mpg < 11)
# A tibble: 2 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
> head(mtc)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
select() and filter() are functions, so they do not modify their input. You can see mtc is unchanged after calling filter() on it. This holds for functions in general.<-> mtc_first_row <- filter(mtc, mpg < 11)
> mtc_first_row
# A tibble: 2 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
arrange takes a data frame and a column, and sorts the rows by the values in that column (ascending order).> powerful <- filter(mtc, hp > 200)
> arrange(powerful, mpg)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, gear, disp)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 10.4 8 460 215 3 5.42 17.8 0 0 3 4
5 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, desc(mpg))
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
> mtc_vars_subset = select(mtc, mpg, hp)
> mutate(mtc_vars_subset, gpm = 1/mpg)
# A tibble: 32 × 3
mpg hp gpm
<dbl> <dbl> <dbl>
1 21 110 0.0476
2 21 110 0.0476
3 22.8 93 0.0439
4 21.4 110 0.0467
5 18.7 175 0.0535
6 18.1 105 0.0552
7 14.3 245 0.0699
8 24.4 62 0.0410
9 22.8 95 0.0439
10 19.2 123 0.0521
# … with 22 more rows
mutate to add a new column to which is the reciprocal of mpg.= is a new name that you make up which you would like the new column to be called= defines what will go into the new column
-mutate() can create multiple columns at the same time and use multiple columns to define a single new one> mutate(mtc_vars_subset, # the newlines make it more readable
+ gpm = 1/mpg,
+ mpg_hp_ratio = mpg/hp)
# A tibble: 32 × 4
mpg hp gpm mpg_hp_ratio
<dbl> <dbl> <dbl> <dbl>
1 21 110 0.0476 0.191
2 21 110 0.0476 0.191
3 22.8 93 0.0439 0.245
4 21.4 110 0.0467 0.195
5 18.7 175 0.0535 0.107
6 18.1 105 0.0552 0.172
7 14.3 245 0.0699 0.0584
8 24.4 62 0.0410 0.394
9 22.8 95 0.0439 0.24
10 19.2 123 0.0521 0.156
# … with 22 more rows
mtc_vars_subset is unchanged after the mutate.filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnsAll verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
